Using a Chunk-based Dependency Parser to Mine Compound Words from Tweets
نویسنده
چکیده
New words are appearing everyday in online communication applications, such as Twitter1. Twitter is the world’s most famous online social networking and microblogging service that enables its users to send/read text-based messages of up to 140 characters, known as “tweets”. Due to the facts that tweets are online typed (as fast as possible) within a limited number of characters, tweets are full of hand-made abbreviations and informal words. These facts make a difference between tweets and frequently used texts in regular web pages, such as news, blogs. Consequently, traditional hand-made corpora (in domains such as news) for natural language processing, such as word segmentation, partof-speech (POS) tagging, parsing, need to be “domain adapted” to be well suitable to tweets. That is, if one Japanese new (compound) word is not successfully recognized by a word segmentation toolkit, we can hardly ensure the word been well covered by a Japanese Input Method Editor (IME) or well translated by a statistical machine translation system. In this paper, we focus on novel compound word detection from Japanese tweets. we propose a method for mining contiguous compound words from single/double Bensetsus generated by a state-of-the-art chunk-based dependency parser, Cabocha2 (Kudo and Matsumoto, 2002) which makes use of Mecab3 with IPA dictionary4 for Japanese word segmentation, POS tagging, and
منابع مشابه
Mining Japanese Compound Words and Their Pronunciations from Web Pages and Tweets
Mining compound words and their pronunciations is essential for Japanese input method editors (IMEs). We propose to use a chunk-based dependency parser to mine new words, collocations and predicate-argument phrases from largescale Japanese Web pages and tweets. The pronunciations of the compound words are automatically rewritten by a statistical machine translation (SMT) model. Experiments on a...
متن کاملA Three Stage Hybrid Parser for Hindi
The present paper describes a three stage technique to parse Hindi sentences. In the first stage we create a model with the features of head words of each chunk and their dependency relations. Here, the dependency relations are inter-chunk dependency relations. We have experimentally fixed a feature set for learning this model. In the second stage, we extract the intra-chunk dependency relation...
متن کاملFeature Engineering in Persian Dependency Parser
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, data-driven dependency parser has been developed with the help of phrase-structure parser fo...
متن کاملA Hybrid Dependency Parser for Bangla
In this paper we describe a two-stage dependency parser for Bangla. In the first stage, we build a model using a Bangla dependency Treebank and subsequently this model is used to build a data driven Bangla parser. In the second stage, constraint based parsing has been used to modify the output of the data driven parser. This second stage module implements the Bangla specific constraints with th...
متن کاملA Data-Driven Dependency Parser for Urdu
One of the main motivations for building treebanks is that they facilitate the development of syntactic parsers, by providing realistic data for evaluation as well as inductive learning. In this paper we present what we believe to be the first data-driven dependency parser for Urdu, which has developed using MaltParser system and trained and evaluated on data from Urdu dependency treebank. A 40...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013